Reduced - error pruning with signi cance testsEIBE
نویسنده
چکیده
When building classiication models, it is common practice to prune them to counter spurious eeects of the training data: this often improves performance and reduces model size. \Reduced-error pruning" is a fast pruning procedure for decision trees that is known to produce small and accurate trees. Apart from the data from which the tree is grown, it uses an independent \pruning" set, and pruning decisions are based on the model's error rate on this fresh data. Recently it has been observed that reduced-error pruning overrts the pruning data, producing unnecessarily large decision trees. This paper investigates whether standard statistical signiicance tests can be used to counter this phenomenon. The problem of overrtting to the pruning set highlights the need for signiicance testing. We investigate two classes of test, \parametric" and \non-parametric." The standard chi-squared statistic can be used both in a parametric test and as the basis for a non-parametric permutation test. In both cases it is necessary to select the signiicance level at which pruning is applied. We show empirically that both versions of the chi-squared test perform equally well if their signiicance levels are adjusted appropriately. Using a collection of standard datasets, we show that signiicance testing improves on standard reduced error pruning if the signiicance level is tailored to the particular dataset at hand using cross-validation, yielding consistently smaller trees that perform at least as well and sometimes better.
منابع مشابه
Adjusting for Multiple Testing in Decision Tree Pruning
Over tting is a widely observed pathology of induction algorithms. Over tted models contain unnecessary structure that re ects nothing more than chance variations in the particular data sample used to construct the model. Portions of these models are literally wrong, and can mislead users. Over tted models require more storage space and take longer to execute than their correctlysized counterpa...
متن کاملConnection pruning with static and adaptive pruning schedules
Neural network pruning methods on the level of individual network parameters (e.g. connection weights) can improve generalization, as is shown in this empirical study. However, an open problem in the pruning methods known today (e.g. OBD, OBS, autoprune, epsiprune) is the selection of the number of parameters to be removed in each pruning step (pruning strength). This work presents a pruning me...
متن کاملSigni cance Regression: Improved Estimation from Collinear Data for the Measurement Error Model
This paper examines improved regression methods for the linear multivariable measurement error model MEM when the data su ers from collinearity The di culty collinearity presents for reliable estimation is discussed and a systematic procedure signi cance regression SR MEM is developed to address collinearity In addition to mitigating collinearity di culties SR MEM produces asymptotically un bia...
متن کاملError Resilient Video Communications in Wireless Atm
A combined source coding, channel coding, and packetization scheme is proposed for high performance video communications over wireless ATM. Three-dimensional signi cance-linked connected component analysis (3DSLCCA) source codec signi cantly reduces error propagations while maintaining high coding e ciency. Channel coding is implemented by using Reed-Solomon codes for both within-cell and inter...
متن کاملComparing Adaptive and Non-Adaptive Connection Pruning With Pure Early Stopping
|Neural network pruning methods on the level of individual network parameters (e.g. connection weights) can improve generalization, as is shown in this empirical study. However, an open problem in the pruning methods known today (OBD, OBS, autoprune, epsiprune) is the selection of the number of parameters to be removed in each pruning step (pruning strength). This work presents a pruning method...
متن کامل